03/12/2019 - 05/12/2019

Overview

Day 1

  • Exploratory Data Analysis: 11.10-12:30
  • Linear Models: 13:15-16:30

Day 2

  • Generalized Linear Models Part I: 9:30-12:30
  • Generalized Linear Models Part II: 13:30-16:30

Day 3

  • Intro to Bayesian Analysis Part I: 9:30-12:30
  • Intro to Bayesian Analysis Part 2: 13:30-15:00

Point System

Small prize at the end of the week for the person with the most points

  • 1 pt per spelling mistake
  • 3 pts per mistake in the code
  • 5 pts per mistake fixed in the code

To be tallied at the end of each day.

Road Map of the Modelling Process

Road Map of the Modelling Process

  1. Identify the data science question you wish to explore

    • At a general, conceptual level: “Are bed nets an effective intervention against malaria”
    • At a specific level: “What is the percentage reduction in malaria cases between area A where bed nets are being used and area B where bed nets are not being used”.
  2. Explore the data

  3. Choose the model

  4. Fit parameters

  5. Estimate confidence intervals/test hypotheses/select models

Exploratory Data Analysis

  • Hypothesis generation
  • Data exploration
  • Formal statistical testing

Iris data

Source: Suruchi Fialoke, October 13, 2016, Classification of Iris Varieties.

Tidy Data

Tabular data is a set of values, where each value is placed in its own “cell”, each variable in its own column, and each observation in its own row.

  • A variable is a quantity, quality, or property that you can measure.

  • A value is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.

  • An observation is a set of measurements made under similar conditions. An observation will contain several values, each associated with a different variable.

Iris data as a pandas dataframe

Who can spot the:

  • variable?
  • value?
  • observation?

Getting to know your data

To get started, let’s explore the following questions for our dataset.

  1. What is the structure of the data?

  2. What type of variation occurs within my variables?

  3. What type of covariation occurs between my variables?

Data Structure and Data Summaries

  • Continuous variable: a variable that can take on an unlimited number of values between the lowest and highest points of measurements.

    • e.g. speed, distance, height
  • Categorical variable can take one of a limited subset of values.

    - e.g. gender, marriage status, county.
  • A Categorical variable is

    - `nominal` if they have no order (e.g. 'Ghana' and 'Uruguay')
    
    - `ordinal` if there is an order associated with them (e.g. 'low', 'medium', and 'high')

Your Turn

Run the lines of code to answer the questions below

  1. What are the dimensions of the dataframe?

  2. What are the first and last values of sepal.length?

  3. Which variables are float64s or categories?

  4. Using the data summary, what is the minimum and maximum sepal length?

  5. What are the names of the columns?

1. What are the dimensions of the dataframe?

print(df_iris.shape)
## (150, 5)

2a. What are the first values of sepal.length?

df_iris.head(5)
##    sepal length (cm)  sepal width (cm)  ...  petal width (cm)  target
## 0                5.1               3.5  ...               0.2  setosa
## 1                4.9               3.0  ...               0.2  setosa
## 2                4.7               3.2  ...               0.2  setosa
## 3                4.6               3.1  ...               0.2  setosa
## 4                5.0               3.6  ...               0.2  setosa
## 
## [5 rows x 5 columns]

2b. What are the last values of sepal.length?

df_iris.tail(5)
##      sepal length (cm)  sepal width (cm)  ...  petal width (cm)     target
## 145                6.7               3.0  ...               2.3  virginica
## 146                6.3               2.5  ...               1.9  virginica
## 147                6.5               3.0  ...               2.0  virginica
## 148                6.2               3.4  ...               2.3  virginica
## 149                5.9               3.0  ...               1.8  virginica
## 
## [5 rows x 5 columns]

3. Which variables are float64s or categories?

df_iris.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 150 entries, 0 to 149
## Data columns (total 5 columns):
## sepal length (cm)    150 non-null float64
## sepal width (cm)     150 non-null float64
## petal length (cm)    150 non-null float64
## petal width (cm)     150 non-null float64
## target               150 non-null category
## dtypes: category(1), float64(4)
## memory usage: 5.0 KB

4. Using the data summary, what is the minimum and maximum sepal length?

df_iris.describe()
##        sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
## count         150.000000        150.000000         150.000000        150.000000
## mean            5.843333          3.057333           3.758000          1.199333
## std             0.828066          0.435866           1.765298          0.762238
## min             4.300000          2.000000           1.000000          0.100000
## 25%             5.100000          2.800000           1.600000          0.300000
## 50%             5.800000          3.000000           4.350000          1.300000
## 75%             6.400000          3.300000           5.100000          1.800000
## max             7.900000          4.400000           6.900000          2.500000

5. What are the names of the columns?

# Let's simplify the column names and make them more meaningful

df_iris = df_iris.rename(columns = {'sepal length (cm)': 'sepal_length', 
      'sepal width (cm)': 'sepal_width', 'petal length (cm)': 'petal_length', 
      'petal width (cm)': 'petal_width','target': 'species'})

# What are the names of the columns?

df_iris.columns
## Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
##        'species'],
##       dtype='object')

Variation and Covariation

Variation and covariation help us understand the spread of the data and identify potential relationships in the data.

  • Variation: is the tendency of values of a variable to change from measurement to measurement.
    • measurement error you may measure the same thing twice and get slightly different values.
    • natural variation in a population or sample (e.g. height)
  • Covariation: tendency of values of a variable to change with the values of another variable.

Visualising a Categorical Variable

species_counts = sns.countplot(x= "species", data = df_iris)
species_counts

Visualising a continuous variable

petal_length_all_distplot = sns.distplot(df_iris['petal_length'], 
              hist = False, kde = True, kde_kws = {'shade': True, 'linewidth': 3})
petal_length_all_distplot.set(xlabel='Petal_length', ylabel='Density')
petal_length_all_distplot;

Visualising a continuous variable

## [Text(0, 0.5, 'Density'), Text(0.5, 0, 'Petal_length')]

Something interesting seems to be happening with the data

Visualising a continuous variable

df_setosa = df_iris[df_iris.species == 'setosa']
petal_length_species = sns.distplot(df_setosa[['petal_length']], label = 'setosa', 
              hist = False, kde = True, kde_kws = {'shade': True, 'linewidth': 3})

df_virginica = df_iris[df_iris.species == 'virginica']
petal_length_species = sns.distplot(df_virginica[['petal_length']], label = 'virginica', 
              hist = False, kde = True, kde_kws = {'shade': True, 'linewidth': 3})

df_versicolor = df_iris[df_iris.species == 'versicolor']
petal_length_species = sns.distplot(df_versicolor[['petal_length']], label = 'versicolor', 
              hist = False, kde = True, kde_kws = {'shade': True, 'linewidth': 3})
petal_length_species.set(xlabel='Petal Length', ylabel='Density x 10')

petal_length_species;

Visualising a continuous variable

## [Text(0, 0.5, 'Density x 10'), Text(0.5, 0, 'Petal Length')]

Continuous and a categorical variable

petal_width_boxplot = sns.boxplot(data = df_iris, y = 'petal_width', x = 'species')
petal_width_boxplot

Continuous and a categorical variable

sepal_length_violin = sns.violinplot(data = df_iris, y = "sepal_length", x = 'species')
sepal_length_violin

Visualising two continuous variables

Plotting two continuous variables, we can see how they change in relation to eachother.

  • positive relationship as one variable increases the other variable increases

  • negative relationship as one variable increases the other decreases

  • no relationship no discernable pattern of change in one variable with the other.

  • non-linear relationship we may also be able to pick out other patterns, e.g. polynomials.

Visualising two continuous variables

iris_scatter_petal_length_sepal_width = sns.scatterplot(data = df_iris, x = 'petal_length',
      y = 'sepal_width', hue = 'species')
iris_scatter_petal_length_sepal_width

Your Turn

Exercises

  1. Make a box plot or a violin plot of sepal width by species. How does this box plot/violin plot compare to the earlier box plot/violin plot we made of petal width and sepal length?

  2. Make a scatterplot to visualise the relationship between petal length and sepal length coloured by species. What patterns can you pick out from the data?

Exercise: Sepal width by species

sepal_width_violin = sns.violinplot(data = df_iris, y = 'sepal_width', x = 'species')
sepal_width_violin

Exercise: Petal length by sepal length

iris_length_scatter = sns.scatterplot(data = df_iris, x = 'petal_length', 
          y = 'sepal_length', hue = 'species')
iris_length_scatter

Your Turn

  1. Pairplots can be a quick and useful way to summarise your dataset quickly and to inspect the relationships simultaneously. What does this code do?
iris_all_pairplot = sns.pairplot(data = df_iris, hue="species", diag_kind="kde")
iris_all_pairplot

Model Basics

Hypothesis generation vs. hypothesis confirmation

Focus of modelling is on inference or confirming a hypothesis is true.

  1. Each observation can either be used for exploration or confirmation, not both.
  2. You can use an observation for exploration as many times as you want.
  3. Can only use an observation once for confirmation.

Model basics

A goal of models is to partition data into patterns and residuals.

  1. Family of models:
    • Express precise, but generic pattern you wish to capture.
    • E.g., a straight line \(y = ax + b\) or quadratic curve \(y = ax^2 + bx + c\).
  2. Fitted model
    • The model family is expressed as an equation.
    • In model fitting, the different parameters are able to take on different values to adjust the shape of the fitted line to the data.
    • N.B. The fitted model is just the closest model from a family of models.

Family of models and fitted models

Response and Explanatory variables

  • Response variable: the measured variable you are trying to explain
    • Dependent, target (machine learning)
  • Explanatory variables: measured variables that we use to try to explain the response variable.
    • Independent, features (machine learning)

Linear Models

Linear models take the mathematical form:

\(y = ax + b\)

  • \(y\) is the response variable
  • \(a\) is the slope
  • \(x\) is an explanatory variable
  • \(b\) is the intercept.

Selling Irises

A data scientist sees a market $ for selling irises to nostalgic ex-statistics undergraduates.

Source: Suruchi Fialoke, October 13, 2016, Classification of Iris Varieties.

Irises with the widest petals sell best (largest petal_width).

  • What is the relationship between petal width and petal length? How do these relationships vary with species?

Petal length and petal width

iris_scatter_petal_length_width = sns.scatterplot(data = df_iris, x = 'petal_length', 
    y = 'petal_width', hue = 'species')
iris_scatter_petal_length_width

##Plotting the relationship

  • \(y = 0.3x + 0.1\)
iris_scatter_petal_length_width = sns.scatterplot(data = df_iris, x = 'petal_length', 
    y = 'petal_width', hue = 'species')

x = np.arange(7)
a = 0.3
b = 0.1
y = a*x + b


iris_scatter_petal_length_width = sns.regplot(x=x, y=y, marker="+")
iris_scatter_petal_length_width

Plotting the relationship

iris_scatter_petal_length_width = sns.scatterplot(data = df_iris, x = 'petal_length', 
    y = 'petal_width', hue = 'species')

x = np.arange(7)
a = 0.37
a1 = 0.35
b = -0.2
b1 = -0.3
y = a*x + b
y1 = a1*x + b

iris_scatter_petal_length_width = sns.regplot(x=x, y=y, marker="+")
iris_scatter_petal_length_width = sns.regplot(x=x, y=y1, marker="+", 
    line_kws={"color": "red"})
iris_scatter_petal_length_width

Plotting the relationship

Model Construction

  • We use the statsmodels function api
> - Formula specified as: `formula = y ~ x`
> - N.B. smf.ols automatically includes the intercept. 
model1 = smf.ols(formula = 'petal_width ~ petal_length', data = df_iris)

results_mod1 = model1.fit()

Fitted Model 1 Output

##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:            petal_width   R-squared:                       0.927
## Model:                            OLS   Adj. R-squared:                  0.927
## Method:                 Least Squares   F-statistic:                     1882.
## Date:                Tue, 21 Apr 2020   Prob (F-statistic):           4.68e-86
## Time:                        11:12:59   Log-Likelihood:                 24.796
## No. Observations:                 150   AIC:                            -45.59
## Df Residuals:                     148   BIC:                            -39.57
## Df Model:                           1                                         
## Covariance Type:            nonrobust                                         
## ================================================================================
##                    coef    std err          t      P>|t|      [0.025      0.975]
## --------------------------------------------------------------------------------
## Intercept       -0.3631      0.040     -9.131      0.000      -0.442      -0.285
## petal_length     0.4158      0.010     43.387      0.000       0.397       0.435
## ==============================================================================
## Omnibus:                        5.765   Durbin-Watson:                   1.455
## Prob(Omnibus):                  0.056   Jarque-Bera (JB):                5.555
## Skew:                           0.359   Prob(JB):                       0.0622
## Kurtosis:                       3.611   Cond. No.                         10.3
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Fitted Model 1 Output

Model Results.

Extracting coefficients from the model output

We can get these specific parameters directly from the model object.

print("Intercept = ",results_mod1.params['Intercept'])
## Intercept =  -0.36307552131902865
print("(Petal length) coef. = ", results_mod1.params['petal_length'])
## (Petal length) coef. =  0.4157554163524114
print("R^2 = ", results_mod1.rsquared_adj)
## R^2 =  0.9266173379025906

Constructing the line of best fit

iris_scatter = sns.scatterplot(data = df_iris, x = 'petal_length', 
    y = 'petal_width', hue = 'species')

x = np.arange(7)
a = 0.37; a1 = 0.35

b = -0.2; b1 = -0.3

y = a*x + b; y1 = a1*x + b1
y2 = results_mod1.params['petal_length']*x + results_mod1.params['Intercept']

iris_scatter = sns.regplot(x = x, y = y, marker="+")
iris_scatter = sns.regplot(x = x, y = y1, marker="+", 
      line_kws = {"color": "red"})
iris_scatter = sns.regplot(x = x, y = y2, marker = "+", 
      line_kws = {"color": "blue"})
iris_scatter

Constructing the line of best fit

Constructing the line of best fit using regplot

iris_best_fit = sns.regplot(x = 'petal_length', y = 'petal_width', 
      ci = 95, data = df_iris)
iris_best_fit

Constructing the line of best fit using regplot

Residuals

  • Residuals represent the left over variation in the response variable not explained by the model.
    • Want to see an ‘amorphous’ cloud
  • Patterns in the residuals may indicate
    • A missing variable
    • The variation in our response variable is not normally distributed.

Residuals for Model 1

df_iris_resid = sns.residplot(y = 'petal_width', x = 'petal_length', data = df_iris)
df_iris_resid

Checking for Normality of Residuals

QQplot for checking Normality

resid = results_mod1.resid
# Use statsmodels qqplot to graph residuals
f, ax = plt.subplots(figsize=(7,7))
sm.graphics.qqplot(resid, line='45', fit=True, ax=ax);

Model 2 with petal length and species

  • Explaining the variation in petal_width using petal_length and C(species)
model2 = smf.ols(formula = 'petal_width ~ petal_length + C(species)', data = df_iris)

results_mod2 = model2.fit()

print(results_mod2.summary())

Model 2 with petal length and species

##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:            petal_width   R-squared:                       0.946
## Model:                            OLS   Adj. R-squared:                  0.944
## Method:                 Least Squares   F-statistic:                     845.5
## Date:                Tue, 21 Apr 2020   Prob (F-statistic):           4.88e-92
## Time:                        11:13:02   Log-Likelihood:                 46.704
## No. Observations:                 150   AIC:                            -85.41
## Df Residuals:                     146   BIC:                            -73.37
## Df Model:                           3                                         
## Covariance Type:            nonrobust                                         
## ============================================================================================
##                                coef    std err          t      P>|t|      [0.025      0.975]
## --------------------------------------------------------------------------------------------
## Intercept                   -0.0908      0.056     -1.611      0.109      -0.202       0.021
## C(species)[T.versicolor]     0.4354      0.103      4.234      0.000       0.232       0.639
## C(species)[T.virginica]      0.8377      0.145      5.764      0.000       0.550       1.125
## petal_length                 0.2304      0.034      6.691      0.000       0.162       0.298
## ==============================================================================
## Omnibus:                        6.210   Durbin-Watson:                   1.736
## Prob(Omnibus):                  0.045   Jarque-Bera (JB):                9.578
## Skew:                          -0.110   Prob(JB):                      0.00832
## Kurtosis:                       4.218   Cond. No.                         54.0
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Residuals of Model 2

## [Text(0, 0.5, 'Residuals'), Text(0.5, 0, 'Fitted')]

QQPlot of Model 2

Model 3 petal length * species interaction code

  • Interactions are specified using the * between two variables.
model3 = smf.ols(formula = 'petal_width ~ petal_length*C(species)', data = df_iris)

results_mod3 = model3.fit()

print(results_mod3.summary())

Model 3 petal length * species interaction output

##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:            petal_width   R-squared:                       0.948
## Model:                            OLS   Adj. R-squared:                  0.946
## Method:                 Least Squares   F-statistic:                     521.9
## Date:                Tue, 21 Apr 2020   Prob (F-statistic):           2.34e-90
## Time:                        11:13:03   Log-Likelihood:                 49.695
## No. Observations:                 150   AIC:                            -87.39
## Df Residuals:                     144   BIC:                            -69.33
## Df Model:                           5                                         
## Covariance Type:            nonrobust                                         
## =========================================================================================================
##                                             coef    std err          t      P>|t|      [0.025      0.975]
## ---------------------------------------------------------------------------------------------------------
## Intercept                                -0.0482      0.215     -0.225      0.823      -0.473       0.376
## C(species)[T.versicolor]                 -0.0361      0.315     -0.114      0.909      -0.659       0.587
## C(species)[T.virginica]                   1.1843      0.334      3.544      0.001       0.524       1.845
## petal_length                              0.2012      0.146      1.380      0.170      -0.087       0.490
## petal_length:C(species)[T.versicolor]     0.1298      0.156      0.835      0.405      -0.178       0.437
## petal_length:C(species)[T.virginica]     -0.0409      0.153     -0.268      0.789      -0.343       0.261
## ==============================================================================
## Omnibus:                        6.984   Durbin-Watson:                   1.804
## Prob(Omnibus):                  0.030   Jarque-Bera (JB):               11.955
## Skew:                          -0.085   Prob(JB):                      0.00254
## Kurtosis:                       4.373   Cond. No.                         176.
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Line of best fit for Model 3 code slope

iris_scatter_species = sns.scatterplot(data = df_iris, x = 'petal_length', 
y = 'petal_width', hue = 'species')

x = np.arange(7)

a_setosa = results_mod3.params['petal_length']

a_virginica = results_mod3.params['petal_length:C(species)[T.virginica]'] + results_mod3.params['petal_length']

a_versicolor = results_mod3.params['petal_length:C(species)[T.versicolor]'] + results_mod3.params['petal_length']

b_setosa = results_mod3.params['Intercept']

b_virginica = results_mod3.params['C(species)[T.virginica]'] + results_mod3.params['Intercept']

b_versicolor = results_mod3.params['C(species)[T.versicolor]'] + results_mod3.params['Intercept']

Line of best fit for Model 3 code intercept and plot

y_setosa = a_setosa*x + b_setosa
y_virginica = a_virginica*x + b_virginica
y_versicolor = a_versicolor*x + b_versicolor

iris_scatter_species = sns.regplot(x = x, y = y_setosa, marker= "+", 
      line_kws = {"color": "blue"})
      
iris_scatter_species = sns.regplot(x = x, y = y_virginica, marker = "+", 
      line_kws = {"color": "green"})
      
iris_scatter_species = sns.regplot(x = x, y = y_versicolor, marker = "+", 
      line_kws = {"color": "orange"})

iris_scatter_species

Line of best fit for Model 3 plot

Residuals for Model 3

QQPlot for Model 3

Model Comparison

  • How do we decide which model is better?

  • The main metrics we will discuss in this course are:

    1. Adjusted R^2
    2. AIC
    3. BIC
  • First, we will discuss parsimony and Occam's Razor

Occam’s razor: law of parsimony

  • Parsimony (aka Occam's razor):

    • Argument to choose simpler model over complex ones.
  • Model selection:

    • Complex model must not just be better, but a specified amount better than a simpler model.
  • Practical considerations:

    • Simple theories are easier to test than complex ones
    • Simple models often do a better job at predicting
    • Simple models require fewer parameters

Occam’s razor: law of parsimony

“With four parameters I can fit an elephant, and with five I can make him wiggle his trunk” ~ John Von Neumann

Model Comparison: Adjusted \(R^2\)

Adjusted R^2

R-squared is a goodness-of-fit measure for linear regression models.

\(\frac{\text{Variance explained by the model}}{ \text{Total variance}}\)

It indicates the percentage of the variance in the response variable explained by the explanatory variables.

Model Comparison: Likelihood

  • The likelihood is the probability of the observed outcome (i.e. the data) given a particular choice of parameters.
  • Maximum likeilhood: used to find the set of parameters that make the observed data most likely to have occurred.

Maximum likelihood. Source: Ashan Priyadarshana. https://medium.com/quick-code/maximum-likelihood-estimation-for-regression-65f9c99f815d

Model Comparison: Information Criteria

Information criteria are based on the expected distance between a particular model and the “true” model.

  • All information-theoretic methods involve finding the model that minimizes some criterion that is the sum of a term based on the likelihood and a penalty term.
    • often twice the negative log-likelihood is used.
    • penalty term varies for different information criteria.
  • Selecting models based on information criteria allows for the comparison of all candidate models at once, provided they are fit to the same data.

AIC

The Akaike Information Criterion, or AIC, is the most widespread information criterion, and is defined as

\(\text{AIC} = -2L + 2k\)

  • \(L\) is the log-likelihood

  • \(k\) is the number of parameters in the model.

  • Small values represent better overall fits

  • Adding a parameter with a negligible improvement in fit penalizes the AIC by 2 log-likelihood units.

Some rough guidance for AIC

  • Lower values of AIC indicated a better fit to the data regardless of whether they are positive or negative.

    • If you have two models with AIC: -213.09, and -289.12. The model with AIC -289.12 is better.
  • AIC comparisons are only valid for models that are fit to the same response data (i.e. same y)

  • For AIC, the rule of thumb people generally follow is that improvements of greater than 2 mean we can select the more complicated model.

BIC

  • The Bayesian information criterion (BIC) is the second most common.
  • It uses a penalty term of \((log n)k\).
  • The BIC is more conservative than the AIC
    • insists on a greater improvement in fit to accept a more complex model.

Approaches to model selection

  • Models with multiple parameters and possible interactions between variable lead to a large number of models to try.

  • Two simple approaches to model selection include:

    • forward selection (add parameters one at a time to the simplest model)
    • backward selection (subtract parameters from the most complex model).

Approaches to model selection

  • Too large a set of possibilities -> can arrive at different best models depending on if you use forward or backward selection.
  • Algorithms that do a combo of forward and backward exist.
  • Want to be careful that model selection does not become data-dredging
  • You should use:
    • Common sense and domain knowledge to identify most important comparisons
    • Plot the best candidate fits. Try to understand why different models fit the data approximatedly equally well.

Your Turn

  1. Let’s compare the models we ran using Adjusted R^2 and AIC. Using the notes above, discuss in groups which model you think is best and why?

Adjusted R^2

## Adjusted R^2 Model 1 =  0.9266173379025906 
## Adjusted R^2 Model 2 =  0.944455796412463 
## Adjusted R^2 Model 3 =  0.9458862539383163

AIC

## AIC Model 1 =  -45.59109158022039 
## AIC Model 2 =  -85.40821914792849 
## AIC Model 3 =  -87.39085622323768
  1. Take a look at the main parameters of Model 2 and Model 3 from the model summary tables. Do they seem to vary much between the models?

Your Turn

Make a new hypothesis from the iris data and take it through the modelling process.